-
Notifications
You must be signed in to change notification settings - Fork 146
Draft: Real vectors for tests #980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for poetic-froyo-8baba7 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
📝 Walkthrough""" WalkthroughThe changes unify vector size constants across test modules by replacing hardcoded values with a single imported variable Suggested reviewers
Note ⚡️ AI Code Reviews for VS Code, Cursor, WindsurfCodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback. Note ⚡️ Faster reviews with cachingCodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 30th. To opt out, configure ✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
🛑 Comments failed to post (37)
docs/source/qdrant_client.http.api.snapshots_api.rst (1)
4-7: 💡 Verification agent
❓ Verification inconclusive
Verify inclusion in the HTTP API index.
Ensure this new docs file is referenced indocs/source/qdrant_client.http.api.rst
so that the snapshots API appears in the HTTP API toctree:
🏁 Script executed:
rg -n "snapshots_api.rst" docs/source/qdrant_client.http.api.rst
Length of output: 64
Add snapshots_api.rst to HTTP API toctree
The newdocs/source/qdrant_client.http.api.snapshots_api.rst
file isn’t referenced in the HTTP API index. Please update:
- File:
docs/source/qdrant_client.http.api.rst
Under the existing.. toctree::
directive, add:snapshots_api.rstdocs/source/qdrant_client.http.api.cluster_api.rst (1)
4-7: 💡 Verification agent
❓ Verification inconclusive
Verify inclusion in the HTTP API index.
Make sure this file is added todocs/source/qdrant_client.http.api.rst
so the cluster API is included:
🏁 Script executed:
rg -n "cluster_api.rst" docs/source/qdrant_client.http.api.rst
Length of output: 62
Add cluster_api.rst to HTTP API docs index
Thecluster_api.rst
page isn’t referenced indocs/source/qdrant_client.http.api.rst
, so the cluster endpoints won’t be included in the generated documentation. Please update the toctree in that file:• File needing update:
docs/source/qdrant_client.http.api.rst
• Example patch:
.. toctree:: :maxdepth: 2 collection_api.rst + cluster_api.rst # …other modules
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.# docs/source/qdrant_client.http.api.rst .. toctree:: :maxdepth: 2 collection_api.rst cluster_api.rst # …other modules
docs/source/qdrant_client.fastembed_common.rst (1)
4-7: 💡 Verification agent
❓ Verification inconclusive
Verify inclusion in the main documentation.
Ensure this file appears in your toctree (e.g., indocs/source/qdrant_client.rst
ordocs/source/index.rst
):
🏁 Script executed:
rg -n "fastembed_common.rst" docs/source/qdrant_client.rst docs/source/index.rst
Length of output: 80
Add
fastembed_common.rst
to the main toctree
The filedocs/source/qdrant_client.fastembed_common.rst
is not referenced in any toctree indocs/source/qdrant_client.rst
ordocs/source/index.rst
.• In
docs/source/qdrant_client.rst
(ordocs/source/index.rst
), insert:.. toctree:: :maxdepth: 2 :caption: Qdrant Client Reference qdrant_client.fastembed_common.github/workflows/integration-tests.yml (2)
60-68: 💡 Verification agent
🧩 Analysis chain
Ensure HF_TOKEN is correctly documented
The addition of the Hugging Face token suggests integration with embedding models, which aligns with the PR objective of using real vectors for tests. Ensure this requirement is documented for developers who will run these tests locally.
🏁 Script executed:
#!/bin/bash # Check if HF_TOKEN usage is documented in the project echo "Checking for HF_TOKEN documentation in README or docs..." rg -i 'HF_TOKEN|hugging\s*face\s*token' --type md . rg -i 'HF_TOKEN|hugging\s*face\s*token' --type rst . rg -i 'HF_TOKEN|hugging\s*face\s*token' README.md || echo "No mention of HF_TOKEN in README.md" echo "Checking for HF_TOKEN usage in test code..." rg -i 'HF_TOKEN|hugging\s*face' --type py ./testsLength of output: 513
Add HF_TOKEN setup documentation for local integration tests
The
HF_TOKEN
environment variable is now required to run the integration tests, but there’s no mention of this in the project docs. Please document how to obtain and setHF_TOKEN
for local development and CI.• In README.md (or under a new “Integration Tests” section), include:
- A note that
HF_TOKEN
must be set before running./tests/integration-tests.sh
- Instructions or a link on how to generate an access token from Hugging Face (e.g., via https://huggingface.co/settings/tokens)
- Examples for setting the variable in shell (
export HF_TOKEN=…
) or in a.env
file• If you maintain separate contributor or developer docs, add the same information there.
40-51: 💡 Verification agent
🧩 Analysis chain
Verify if conditional tests should run on all Python versions
The async client and inspection cache tests are currently limited to Python 3.10.x. Consider whether these should also run on other Python versions to ensure consistent behavior across environments.
🏁 Script executed:
#!/bin/bash # Check if there are any Python version-specific dependencies or features used in the async client # or inspection cache that would justify limiting these tests to Python 3.10.x only echo "Checking for Python 3.10-specific imports or features in async client code..." rg -i 'python\s*3\.10' --type py ./qdrant_client rg -i 'match\s+.*:' --type py ./qdrant_client # Pattern matching (3.10 feature) rg -i 'PEP.*?585' --type py ./qdrant_client echo "Checking if test scripts have any Python version checks..." cat ./tests/async-client-consistency-check.sh cat ./tests/inspection-cache-consistency-check.shLength of output: 2715
I don’t see any Python-3.10-only constructs in the client code or test scripts, and the tooling scripts themselves invoke generic
python
(rather than pinning to 3.10). The consistency-check steps should work on all supported Python versions.Update the workflow to run these tests on every Python version
.github/workflows/integration-tests.yml @@ lines 40-51 - - name: Run async client generation tests - run: | - if [[ ${{ matrix.python-version }} == "3.10.x" ]]; then - ./tests/async-client-consistency-check.sh - fi - shell: bash + - name: Run async client generation tests + run: ./tests/async-client-consistency-check.sh + shell: bash - - name: Run inspection cache population tests - run: | - if [[ ${{ matrix.python-version }} == "3.10.x" ]]; then - ./tests/inspection-cache-consistency-check.sh - fi - shell: bash + - name: Run inspection cache population tests + run: ./tests/inspection-cache-consistency-check.sh + shell: bashThis ensures both checks execute across all Python versions in your CI matrix.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.- name: Run async client generation tests run: ./tests/async-client-consistency-check.sh shell: bash - name: Run inspection cache population tests run: ./tests/inspection-cache-consistency-check.sh shell: bash
qdrant_client/embed/models.py (1)
1-12:
⚠️ Potential issueAdd
from __future__ import annotations
for Python 3.8 compatibility
list[...]
generics are evaluated at runtime and therefore raiseTypeError: 'type' object is not subscriptable
on Python 3.8.
Unless the client has officially dropped 3.8 support, this will be a runtime blocker.+from __future__ import annotations from typing import Union
Alternatively, fall back to
typing.List
/typing.Dict
, but the future-import keeps the code cleaner.Also applies to: 13-23
qdrant_client/embed/type_inspector.py (1)
42-50:
⚠️ Potential issueAvoid premature
return False
when iterating collections
inspect()
exits on the first non-BaseModel
element, potentially missing valid models later in the iterable.- else: - return False + else: + # Skip non-pydantic items and continue inspecting + continueWith this change, the loop continues scanning the remaining elements, ensuring no false negatives.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.elif isinstance(points, Iterable): for point in points: if isinstance(point, BaseModel): self.parser.parse_model(point.__class__) if self._inspect_model(point): return True else: # Skip non-pydantic items and continue inspecting continue return False
qdrant_client/common/version_check.py (1)
12-18: 🛠️ Refactor suggestion
Add timeout & exception handling to avoid client hangs
httpx.get
is executed without a timeout and outside a try/except block.
A transient network issue will block indefinitely or raise and crash the caller.- response = httpx.get(rest_uri, headers=rest_headers, auth=auth_provider) + try: + response = httpx.get( + rest_uri, + headers=rest_headers, + auth=auth_provider, + timeout=5.0, # seconds – tune if necessary + ) + except httpx.HTTPError as exc: + logging.debug(f"Failed to fetch server version: {exc!r}") + return None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def get_server_version( rest_uri: str, rest_headers: dict[str, Any], auth_provider: Optional[BearerAuth] ) -> Optional[str]: try: response = httpx.get( rest_uri, headers=rest_headers, auth=auth_provider, timeout=5.0, # seconds – tune if necessary ) except httpx.HTTPError as exc: logging.debug(f"Failed to fetch server version: {exc!r}") return None if response.status_code == 200: version_info = response.json().get("version", None)
qdrant_client/hybrid/fusion.py (1)
7-11: 🛠️ Refactor suggestion
Rank index should be 1-based in Reciprocal Rank Fusion
Classical RRF uses
1 / (k + rank)
whererank
starts at 1 (best item).
Using the 0-based loop index over-weights every entry by1
.- def compute_score(pos: int) -> float: + def compute_score(pos: int) -> float: ranking_constant = ( 2 # the constant mitigates the impact of high rankings by outlier systems ) - return 1 / (ranking_constant + pos) + return 1 / (ranking_constant + pos + 1)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def compute_score(pos: int) -> float: ranking_constant = ( 2 # the constant mitigates the impact of high rankings by outlier systems ) return 1 / (ranking_constant + pos + 1)
qdrant_client/_pydantic_compat.py (2)
65-67: 🛠️ Refactor suggestion
Return a homogeneous type from
model_config
For Pydantic v2 the function now returns a
ConfigDict
, not a plaindict
, which breaks the declared return annotation and downstreamdict
operations (e.g.model_config(...)["populate_by_name"]
). Wrap withdict()
for v2 to keep behaviour identical across versions.- return model.model_config + return dict(model.model_config)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def model_config(model: Type[BaseModel]) -> dict[str, Any]: if PYDANTIC_V2: - return model.model_config + return dict(model.model_config)
19-20:
⚠️ Potential issuePrevent possible
KeyError
into_jsonable_python
for Pydantic v1
ENCODERS_BY_TYPE[type(x)]
will raise whentype(x)
has no registered encoder (e.g.,Decimal
). A safe fallback keeps the helper usable for arbitrary objects:- def to_jsonable_python(x: Any) -> Any: - return ENCODERS_BY_TYPE[type(x)](x) + def to_jsonable_python(x: Any) -> Any: + encoder = ENCODERS_BY_TYPE.get(type(x)) + return encoder(x) if encoder else json.loads(json.dumps(x, default=str))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def to_jsonable_python(x: Any) -> Any: encoder = ENCODERS_BY_TYPE.get(type(x)) return encoder(x) if encoder else json.loads(json.dumps(x, default=str))
qdrant_client/http/api_client.py (2)
86-89: 🛠️ Refactor suggestion
timeout
is copied but not removed from query paramsAfter converting
"timeout"
to an integer you still leave"timeout"
insideparams
, so the same value is sent twice: once as an HTTP query parameter and once as the HTTPX timeout. This can confuse the server and breaks caching. Remove it fromparams
after extraction.- if "params" in kwargs and "timeout" in kwargs["params"]: - kwargs["timeout"] = int(kwargs["params"]["timeout"]) + if "params" in kwargs and "timeout" in kwargs["params"]: + kwargs["timeout"] = int(kwargs["params"].pop("timeout"))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.url = urljoin((self.host or ""), url.format(**path_params)) if "params" in kwargs and "timeout" in kwargs["params"]: kwargs["timeout"] = int(kwargs["params"].pop("timeout")) request = self._client.build_request(method, url, **kwargs)
109-120:
⚠️ Potential issueRobust JSON parsing when handling 429 responses
Indexing
resp["status"]
will raiseKeyError
if the field is absent. Use.get()
chaining to avoid masking the original error handling with an accidentalKeyError
.- resp = response.json() - message = resp["status"]["error"] if resp["status"] and resp["status"]["error"] else "" + resp = response.json() + message = ( + resp.get("status", {}).get("error") # graceful if key missing + or "" + )Replicate the same fix in the async branch (lines 193-203).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.if response.status_code == 429: retry_after_s = response.headers.get("Retry-After", None) try: resp = response.json() message = ( resp.get("status", {}).get("error") # graceful if key missing or "" ) except Exception: message = "" if retry_after_s: raise ResourceExhaustedResponse(message, retry_after_s) if response.status_code in [200, 201, 202]:
qdrant_client/http/api/aliases_api.py (1)
18-44: 💡 Verification agent
🧩 Analysis chain
Consider extracting common utility functions to a shared module
The JSON serialization utilities (
to_json
andjsonable_encoder
) are duplicated across multiple API modules. Consider extracting these to a shared utility module to reduce duplication.
🏁 Script executed:
#!/bin/bash # Check how many files contain duplicated JSON serialization utility functions echo "Files with to_json function:" rg "def to_json\(model: BaseModel" --type py qdrant_client/ echo "Files with jsonable_encoder function:" rg "def jsonable_encoder\(" --type py qdrant_client/Length of output: 1745
Extract common JSON serialization utilities to a shared module
The
to_json
andjsonable_encoder
functions are duplicated in nine API modules. Centralize them into a single utility file to reduce maintenance overhead and ensure consistency.Files containing the duplication:
- qdrant_client/http/api/aliases_api.py
- qdrant_client/http/api/beta_api.py
- qdrant_client/http/api/collections_api.py
- qdrant_client/http/api/distributed_api.py
- qdrant_client/http/api/indexes_api.py
- qdrant_client/http/api/points_api.py
- qdrant_client/http/api/search_api.py
- qdrant_client/http/api/service_api.py
- qdrant_client/http/api/snapshots_api.py
Suggested steps:
- Create a new file, e.g.
qdrant_client/http/api/utils.py
, containing the shared functions:# qdrant_client/http/api/utils.py from typing import Any, Union, SetIntStr, DictIntStrAny from pydantic import BaseModel from .version import PYDANTIC_V2 def to_json(model: BaseModel, *args: Any, **kwargs: Any) -> str: if PYDANTIC_V2: return model.model_dump_json(*args, **kwargs) return model.json(*args, **kwargs) def jsonable_encoder( obj: Any, include: Union[SetIntStr, DictIntStrAny] = None, exclude=None, by_alias: bool = True, skip_defaults: bool = None, exclude_unset: bool = True, exclude_none: bool = True, ): if hasattr(obj, "json") or hasattr(obj, "model_dump_json"): return to_json( obj, include=include, exclude=exclude, by_alias=by_alias, exclude_unset=bool(exclude_unset or skip_defaults), exclude_none=exclude_none, ) return obj- In each API module, remove the local definitions and update imports:
- from pydantic import BaseModel - from .version import PYDANTIC_V2 - - def to_json(…): … - def jsonable_encoder(…): … + from .utils import to_json, jsonable_encoderThis refactor will eliminate duplication and centralize future changes.
qdrant_client/http/api/search_api.py (1)
7-8: 🛠️ Refactor suggestion
Avoid
import *
– switch to explicit importsStar imports obfuscate the public surface, make static analysis harder, and are flagged by Ruff (F403/F405). Import the required symbols explicitly or keep the
models as m
alias only.-from qdrant_client.http.models import * -from qdrant_client.http.models import models as m +from qdrant_client.http.models import models as m # single, explicit entry-point📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.-from qdrant_client.http.models import * -from qdrant_client.http.models import models as m +from qdrant_client.http.models import models as m # single, explicit entry-point
🧰 Tools
🪛 Ruff (0.8.2)
7-7:
from qdrant_client.http.models import *
used; unable to detect undefined names(F403)
qdrant_client/hybrid/formula.py (2)
109-116:
⚠️ Potential issue
exponent.is_integer()
fails whenexponent
isint
int
objects have nois_integer()
method. Whenexponent
is anint
, this raisesAttributeError
.- if base >= 0 or (base != 0 and exponent.is_integer()): + if base >= 0 or (base != 0 and (isinstance(exponent, int) or (isinstance(exponent, float) and exponent.is_integer()))):Alternatively, cast to
float
and guard withmath.isfinite
.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.# Check for valid input - if base >= 0 or (base != 0 and exponent.is_integer()): + if base >= 0 or (base != 0 and (isinstance(exponent, int) or (isinstance(exponent, float) and exponent.is_integer()))): try: return math.pow(base, exponent) except OverflowError: pass raise_non_finite_error(f"{base}^{exponent}")
348-358:
⚠️ Potential issueReplace
assert False
in tests
assert False
is stripped under optimisation (python -O
). Usepytest.raises
or raiseAssertionError
directly.- parse_variable("$score[invalid]") - assert False + with pytest.raises(ValueError): + parse_variable("$score[invalid]")Repeat for the second occurrence.
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.8.2)
350-350: Do not
assert False
(python -O
removes these calls), raiseAssertionError()
Replace
assert False
(B011)
356-356: Do not
assert False
(python -O
removes these calls), raiseAssertionError()
Replace
assert False
(B011)
pyproject.toml (1)
25-35: 🛠️ Refactor suggestion
Consider upper-bounding
protobuf
&numpy
to avoid unexpected breaking majors
protobuf = ">=3.20.0"
andnumpy
for non-3.9 ranges are currently un-bounded on the upper side. Both libraries regularly ship backwards-incompatible major releases (e.g.,protobuf 5.*
). A conservative constraint such as<5
(protobuf) and<2.2
(numpy) gives you time to test new majors before the ecosystem forces them in.-protobuf = ">=3.20.0" +protobuf = ">=3.20.0,<5"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.pydantic = ">=1.10.8,!=2.0.*,!=2.1.*,!=2.2.0" # can't use poetry ">=1.10.8,<2.0 || >=2.2.1" since pip is complaining grpcio = { version = ">=1.41.0", allow-prereleases = true } - protobuf = ">=3.20.0" + protobuf = ">=3.20.0,<5" urllib3 = ">=1.26.14,<3" portalocker = "^2.7.0" fastembed = [ { version = "0.6.1", optional = true }, ] fastembed-gpu = [ { version = "0.6.1", optional = true }, ]
qdrant_client/connection.py (4)
98-103:
⚠️ Potential issuestream-stream interceptor: remove unnecessary await
grpc.aio.StreamStreamCall
isn’t awaitable. Replace with the pattern used above.
89-94:
⚠️ Potential issuestream-unary interceptor suffers from identical pattern
Apply the same fix (
call = continuation(...)
) and adjust the return path.
70-76:
⚠️ Potential issueDouble-await bug:
intercept_unary_unary
breaks on real calls
response
is awaited once here and converted to a plain protobuf message.
process_response
then tries toawait
it again, raisingTypeError: object ResponseMessage is not awaitable
.- response = await continuation(new_details, next_request) - return await postprocess(response) if postprocess else response + call = continuation(new_details, next_request) # call is awaitable + if postprocess: + return await postprocess(call) + return await call📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.new_details, new_request_iterator, postprocess = await self._fn( client_call_details, iter((request,)), False, False ) next_request = next(new_request_iterator) call = continuation(new_details, next_request) if postprocess: return await postprocess(call) return await call
80-85:
⚠️ Potential issueSame double-await issue for unary-stream
For streaming responses the call object is not awaitable at all, so the current
await continuation(...)
raises immediately.- response_it = await continuation(new_details, next(new_request_iterator)) - return await postprocess(response_it) if postprocess else response_it + call = continuation(new_details, next(new_request_iterator)) + return await postprocess(call) if postprocess else call📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.new_details, new_request_iterator, postprocess = await self._fn( client_call_details, iter((request,)), False, True ) call = continuation(new_details, next(new_request_iterator)) return await postprocess(call) if postprocess else call
qdrant_client/conversions/common_types.py (1)
68-75:
⚠️ Potential issue
list[...]
breaks Py < 3.9 compatibility
list[PointId]
and similar PEP-585 generics are only valid syntax starting with Python 3.9.
The header comment (lines 143-145) says we still support Python 3.7, so importing this
module on 3.7/3.8 will raiseSyntaxError
before users can even reach the runtime
version-check in their own code.-PointsSelector = Union[ - list[PointId], +from typing import List # at top of file +PointsSelector = Union[ + List[PointId],Update every other occurrence of the builtin-generic syntax (
list[...]
,dict[...]
)
in this file (there are several) or drop 3.7/3.8 from the supported versions list.Committable suggestion skipped: line range outside the PR's diff.
qdrant_client/http/api/distributed_api.py (2)
4-7: 🛠️ Refactor suggestion
Duplicate & wildcard imports clutter the namespace
Lines 4-7 import
BaseModel
twice and pull everything from
qdrant_client.http.models
via*
. Apart from the Ruff warnings, this:
- masks the second
BaseModel
reference,- makes static analysis and IDE completion much harder,
- risks name collisions.
-from pydantic import BaseModel -from pydantic.main import BaseModel -from qdrant_client.http.models import * +from pydantic import BaseModel +from qdrant_client.http import models as m(The
as m
alias is already used below.)
Please remove the redundant import and avoidimport *
.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.-from pydantic import BaseModel -from pydantic.main import BaseModel -from qdrant_client.http.models import * +from pydantic import BaseModel +from qdrant_client.http import models as m from pydantic.version import VERSION as PYDANTIC_VERSION
🧰 Tools
🪛 Ruff (0.8.2)
5-5: Redefinition of unused
BaseModel
from line 4(F811)
7-7:
from qdrant_client.http.models import *
used; unable to detect undefined names(F403)
18-23:
⚠️ Potential issue
to_json
does not propagateexclude_unset/none
to Pydantic v2 correctly
model_dump_json()
takesexclude_none
but notexclude_unset
; passing an
unexpected kwarg raisesTypeError
.
Becausejsonable_encoder
always forwardsexclude_unset=True
, code paths running
under Pydantic v2 will explode.Consider splitting the kwargs explicitly:
-def to_json(model: BaseModel, *args: Any, **kwargs: Any) -> str: - if PYDANTIC_V2: - return model.model_dump_json(*args, **kwargs) +def to_json(model: BaseModel, *args: Any, **kwargs: Any) -> str: + if PYDANTIC_V2: + kwargs.pop("exclude_unset", None) # not supported + return model.model_dump_json(*args, **kwargs)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def to_json(model: BaseModel, *args: Any, **kwargs: Any) -> str: if PYDANTIC_V2: kwargs.pop("exclude_unset", None) # not supported return model.model_dump_json(*args, **kwargs) else: return model.json(*args, **kwargs)
qdrant_client/embed/model_embedder.py (2)
32-35:
⚠️ Potential issue
ModelEmbedderWorker.start()
causesTypeError
start()
passes an unexpectedthreads
keyword toModelEmbedderWorker.__init__
,
which only acceptsbatch_size
and**kwargs
.- return cls(threads=1, batch_size=batch_size, **kwargs) + # ParallelWorkerPool already sets the number of processes; no need for “threads”. + return cls(batch_size=batch_size, **kwargs)Without this fix every attempt to spin up a worker process fails immediately.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.@classmethod def start(cls, batch_size: int, **kwargs: Any) -> "ModelEmbedderWorker": # ParallelWorkerPool already sets the number of processes; no need for “threads”. return cls(batch_size=batch_size, **kwargs)
268-273: 🛠️ Refactor suggestion
Early
return
inside list traversal skips subsequent inference objectsIf the first element of a list is not an inference object we
return None
prematurely, silently discarding the rest of the list.- if isinstance(data, list): - for value in data: - if not isinstance(value, get_args(INFERENCE_OBJECT_TYPES)): - return None - self._accumulate(value) + if isinstance(data, list): + for value in data: + if isinstance(value, get_args(INFERENCE_OBJECT_TYPES)): + self._accumulate(value) + return NoneThis ensures all items are inspected.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.if isinstance(data, list): for value in data: if isinstance(value, get_args(INFERENCE_OBJECT_TYPES)): self._accumulate(value) return None
qdrant_client/http/api/service_api.py (1)
4-7: 🛠️ Refactor suggestion
Remove the duplicate
BaseModel
import and avoidimport *
.
BaseModel
is imported twice (lines 4 & 5), triggering Ruff F811 and creating ambiguous symbols.
In addition, the wildcard import on line 7 hides the public surface and hurts IDE auto-completion.-from pydantic import BaseModel -from pydantic.main import BaseModel -from qdrant_client.http.models import * +from pydantic import BaseModel +from qdrant_client.http import models as m(The last line lets you access generated models through the existing
m
alias.)
This keeps the namespace clean and eliminates the redefinition warning.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.from pydantic import BaseModel from pydantic.version import VERSION as PYDANTIC_VERSION from qdrant_client.http import models as m
🧰 Tools
🪛 Ruff (0.8.2)
5-5: Redefinition of unused
BaseModel
from line 4(F811)
7-7:
from qdrant_client.http.models import *
used; unable to detect undefined names(F403)
qdrant_client/http/api/collections_api.py (1)
4-7: 🛠️ Refactor suggestion
Same duplicate import & wildcard issue as in
service_api.py
.Please apply the import-cleanup change here as well to silence F811/F403 and keep the public namespace tidy.
🧰 Tools
🪛 Ruff (0.8.2)
5-5: Redefinition of unused
BaseModel
from line 4(F811)
7-7:
from qdrant_client.http.models import *
used; unable to detect undefined names(F403)
qdrant_client/grpc/collections_pb2.py (1)
50-52:
⚠️ Potential issueAvoid polluting the module namespace with
None
.
globals()['None'] = 0
creates a module attribute calledNone
and risks severe confusion:from qdrant_client.grpc.collections_pb2 import None # now legal!Although it doesn’t overwrite the built-in constant, importing this symbol elsewhere shadows the real
None
, leading to hard-to-trace bugs.
If the enum generator must expose a “None” value, prefer a safer alias such asModifier_None = 0
.qdrant_client/embed/embedder.py (2)
278-286: 🛠️ Refactor suggestion
⚠️ Potential issue
**options or {}
triggers aSyntaxError
Inside a call expression the
**
operator must receive a single mapping expression.
Usingor
directly after**
is illegal Python syntax and the file will not import.Refactor once and reuse locally-normalised kwargs:
-embedding_model_inst = self.get_or_init_model(model_name=model_name, **options or {}) +model_opts: dict[str, Any] = options or {} +embedding_model_inst = self.get_or_init_model(model_name=model_name, **model_opts)Consider extracting a small helper such as
_normalize_options(options)
to DRY this pattern across all_embed_*
helpers.Also applies to: 296-304, 327-338, 348-356, 365-373, 380-385
26-30:
⚠️ Potential issueSyntax error: invalid keyword in class definition
arbitrary_types_allowed=True
cannot be passed as a keyword argument in theclass
header – onlymetaclass
(and since 3.11,slots
) are allowed.
Move that flag into a Pydantic configuration block to avoid a hard runtimeSyntaxError
.-class ModelInstance(BaseModel, Generic[T], arbitrary_types_allowed=True): # type: ignore[call-arg] - model: T - options: dict[str, Any] - deprecated: bool = False +class ModelInstance(Generic[T], BaseModel): + model: T + options: dict[str, Any] + deprecated: bool = False + + class Config: # Pydantic v1 + arbitrary_types_allowed = True + + model_config = { # Pydantic v2 (ignored by v1) + "arbitrary_types_allowed": True, + }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.class ModelInstance(Generic[T], BaseModel): model: T options: dict[str, Any] deprecated: bool = False class Config: # Pydantic v1 arbitrary_types_allowed = True model_config = { # Pydantic v2 (ignored by v1) "arbitrary_types_allowed": True, }
qdrant_client/http/api/snapshots_api.py (1)
2-7: 🛠️ Refactor suggestion
Duplicate import & star-import hide real problems
BaseModel
is imported twice (line 4 and 5).from qdrant_client.http.models import *
makes type-checking noisy and is the reasonSnapshotPriority
appears “possibly undefined”.-from typing import IO, TYPE_CHECKING, Any, Dict, Set, TypeVar, Union -from pydantic import BaseModel -from pydantic.main import BaseModel +from typing import IO, TYPE_CHECKING, Any, Dict, Set, TypeVar, Union +from pydantic import BaseModelReplace the star import with explicit names (or keep the
models as m
alias only) and importSnapshotPriority
directly:-from qdrant_client.http.models import * -from qdrant_client.http.models import models as m +from qdrant_client.http.models import models as m, SnapshotPriorityThis eliminates re-definitions, improves IDE support, and resolves the Ruff warnings F403/F811/F405.
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.8.2)
5-5: Redefinition of unused
BaseModel
from line 4(F811)
7-7:
from qdrant_client.http.models import *
used; unable to detect undefined names(F403)
qdrant_client/http/api/points_api.py (1)
2-8: 🛠️ Refactor suggestion
Import hygiene: duplicated
BaseModel
and uncontrolled namespaceSame issue as in
snapshots_api.py
:
- Redundant
BaseModel
import (line 4 & 5)from qdrant_client.http.models import *
obscures missing symbols such asWriteOrdering
.Adopt the same fix (keep a single
BaseModel
import and replace the star import with explicit names or the existing aliasm
). This will also silence Ruff F403/F811/F405.🧰 Tools
🪛 Ruff (0.8.2)
5-5: Redefinition of unused
BaseModel
from line 4(F811)
7-7:
from qdrant_client.http.models import *
used; unable to detect undefined names(F403)
qdrant_client/async_qdrant_fastembed.py (1)
398-425: 🛠️ Refactor suggestion
⚠️ Potential issueAvoid relying on
assert
for production-level validation
assert
statements are stripped when Python is executed with the-O
(optimized) flag, which would silently bypass all the collection-compatibility checks. Converting these into explicit exceptions keeps the safeguards intact in all execution modes and allows you to surface a precise error type/message to the caller.-assert ( - embeddings_size == vector_params.size -), f"Embedding size mismatch: {embeddings_size} != {vector_params.size}" +if embeddings_size != vector_params.size: + raise ValueError( + f"Embedding size mismatch: {embeddings_size} != {vector_params.size}" + )(…repeat for the remaining
assert
statements in this block…)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.(embeddings_size, distance) = self._get_model_params(model_name=self.embedding_model_name) vector_field_name = self.get_vector_field_name() assert isinstance( collection_info.config.params.vectors, dict ), f"Collection have incompatible vector params: {collection_info.config.params.vectors}" assert ( vector_field_name in collection_info.config.params.vectors ), f"Collection have incompatible vector params: {collection_info.config.params.vectors}, expected {vector_field_name}" vector_params = collection_info.config.params.vectors[vector_field_name] - assert ( - embeddings_size == vector_params.size - ), f"Embedding size mismatch: {embeddings_size} != {vector_params.size}" + if embeddings_size != vector_params.size: + raise ValueError( + f"Embedding size mismatch: {embeddings_size} != {vector_params.size}" + ) assert ( distance == vector_params.distance ), f"Distance mismatch: {distance} != {vector_params.distance}" sparse_vector_field_name = self.get_sparse_vector_field_name() if sparse_vector_field_name is not None: assert ( sparse_vector_field_name in collection_info.config.params.sparse_vectors ), f"Collection have incompatible vector params: {collection_info.config.params.vectors}" if self.sparse_embedding_model_name in IDF_EMBEDDING_MODELS: modifier = collection_info.config.params.sparse_vectors[ sparse_vector_field_name ].modifier assert ( modifier == models.Modifier.IDF ), f"{self.sparse_embedding_model_name} requires modifier IDF, current modifier is {modifier}"
qdrant_client/async_client_base.py (1)
382-386: 🛠️ Refactor suggestion
Inconsistent “sync” method in an async interface
upload_points
is defined as a synchronous method inside an otherwise asynchronous base class.
All real implementations that hit the network or disk will need an async API to avoid blocking the event loop.Consider:
- def upload_points( + async def upload_points( self, collection_name: str, points: Iterable[types.PointStruct], **kwargs: Any ) -> None:and mark it
@abstractmethod
(see next comment).📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.async def upload_points( self, collection_name: str, points: Iterable[types.PointStruct], **kwargs: Any ) -> None: raise NotImplementedError()
qdrant_client/local/async_qdrant_local.py (1)
916-961:
⚠️ Potential issueCritical bug: inconsistent initial scroll limit
Increate_collection()
, the first scroll useslimit=2
while subsequent loops usebatch_size = 100
. This will only copy two points initially instead of the intended batch size. It should read:- (records, next_offset) = await self.scroll( - from_collection_name, limit=2, with_vectors=True - ) + (records, next_offset) = await self.scroll( + from_collection_name, limit=batch_size, with_vectors=True + )Please align the initial
limit
withbatch_size
to ensure consistent data copying.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.if src_collection and from_collection_name: batch_size = 100 (records, next_offset) = await self.scroll( from_collection_name, limit=batch_size, with_vectors=True ) self.upload_records(collection_name, records) while next_offset is not None: (records, next_offset) = await self.scroll( from_collection_name, offset=next_offset, limit=batch_size, with_vectors=True ) self.upload_records(collection_name, records)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/congruence_tests/test_discovery.py (1)
25-25
: Improved test consistency by using sample queries.Replacing random vector generation with predefined sample queries creates more consistent test behavior. However, the function still accepts a
dims
parameter that's no longer used.Consider updating the function signature to remove the unused parameter:
-def random_vector(dims: int) -> list[float]: +def random_vector(_: int = None) -> list[float]:Or better yet, rename the function to reflect its new behavior:
-def random_vector(dims: int) -> list[float]: +def sample_vector(_: int = None) -> list[float]:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
tests/congruence_tests/test_collections.py
(1 hunks)tests/congruence_tests/test_common.py
(1 hunks)tests/congruence_tests/test_discovery.py
(1 hunks)tests/congruence_tests/test_group_search.py
(4 hunks)tests/congruence_tests/test_query.py
(3 hunks)tests/congruence_tests/test_query_batch.py
(3 hunks)tests/congruence_tests/test_recommendation.py
(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- tests/congruence_tests/test_recommendation.py
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Redirect rules - poetic-froyo-8baba7
- GitHub Check: Header rules - poetic-froyo-8baba7
- GitHub Check: Pages changed - poetic-froyo-8baba7
🔇 Additional comments (17)
tests/congruence_tests/test_collections.py (1)
104-104
: Vector size standardization improves test consistency.The vector size increase from 2 to 384 aligns with the standardized vector size used across the test suite. This change ensures all test vectors have consistent dimensions, creating more realistic test scenarios.
tests/congruence_tests/test_common.py (2)
17-19
: Standardized vector dimensions improve test realism.Updating all dense vector size constants to 384 creates consistency across different vector types and makes tests more representative of real-world scenarios. This change supports the PR objective of using real vectors for testing.
22-24
: Standardized sparse vector dimensions improve consistency.Unifying sparse vector dimensions to 384 (from previously varied sizes) ensures consistency between dense and sparse vector tests. This standardization better represents real-world usage patterns.
tests/congruence_tests/test_discovery.py (1)
19-19
: Added sample_queries import for using real vectors.Good addition of the import for sample_queries from fixtures, supporting the move away from random vectors to predefined real vectors for testing.
tests/congruence_tests/test_query_batch.py (4)
22-26
: Improved imports for real vector testing.Good addition of sample_queries import along with the existing random vector generation functions, supporting the transition to using real vectors for testing.
46-50
: Enhanced test realism with sample queries for text vectors.Replacing random vector generation with predefined sample queries creates more consistent and realistic test inputs. This systematic change to how text queries and prefetch queries are constructed aligns well with the PR objective.
58-59
: Standardized approach to image and code vector queries.Good consistency in applying the same pattern of using sample queries for both image and code vector queries. This ensures all vector types benefit from the improved test realism.
Also applies to: 66-67
107-114
: Improved DBSF fusion query tests with real vectors.Using sample queries for the DBSF fusion prefetch queries completes the systematic replacement of random vectors, ensuring all test scenarios use realistic vectors.
tests/congruence_tests/test_group_search.py (5)
21-21
: Good addition of the sample_queries import.This change aligns with the PR's objective to use real vectors for testing, replacing random vectors with predefined samples.
30-33
: Excellent standardization of test vectors.Replacing random vector generation with predefined sample queries improves test determinism and consistency. This change ensures that tests use the same vector data across runs, making test results more reproducible.
222-223
: Good standardization of vector sizes.Using the
text_vector_size
constant instead of hardcoded values makes the code more maintainable and consistent with other tests. This change supports the standardization of vector sizes across the test suite.
233-239
: Good update to query vector initialization.Using
sample_queries()
instead of random vectors improves test consistency. The wrapping of the query vector in a NumPy array is maintained for the first test case, which preserves the existing test coverage for different input types.
241-243
: Simplified function call by passing vector directly.Passing the vector directly without unnecessary conversion improves code clarity. This change is consistent with the goal of standardizing how vectors are handled in tests.
tests/congruence_tests/test_query.py (4)
36-36
: Good addition of the sample_queries import.This change aligns with the PR's objective to standardize test vectors across the test suite.
50-58
: Excellent standardization of query vector initialization.Using predefined sample queries from
sample_queries()
instead of random vectors ensures test consistency and reproducibility. The modification todense_vector_query_text_bis
on line 56 (creating a slightly different vector by adding 42.0) is preserved, maintaining the existing test behavior while benefiting from standardized test data.
1465-1467
: Good standardization of vector configuration.Using the
text_vector_size
constant for vector configuration instead of hardcoded values improves consistency and maintainability. This change supports the standardization of vector sizes across the test suite.
1469-1469
: Good update to fixture generation parameters.Using the
text_vector_size
parameter in thegenerate_fixtures
call ensures consistency with the vector configuration and other test files. This change further supports the standardization effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (1)
tests/fixtures/points.py (1)
141-142
: Consider an edge case for theskip_vectors
functionalityWhen using preloaded vectors and the
skip_vectors
option is enabled, there's no special handling for the case where vectors are skipped based on random chance (as seen in lines 143-146).Consider how the
skip_vectors
functionality should work with preloaded vectors. If a vector is skipped, the index-to-vector mapping might become inconsistent.You might want to add a comment explaining this behavior or adjust the code to handle this edge case properly.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/fixtures/points.py
(5 hunks)
🧰 Additional context used
🪛 GitHub Actions: Integration tests
tests/fixtures/points.py
[error] 13-13: FileNotFoundError: No such file or directory: 'data/text.npy'. The test failed because the required data file 'data/text.npy' is missing.
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Redirect rules - poetic-froyo-8baba7
- GitHub Check: Header rules - poetic-froyo-8baba7
- GitHub Check: Pages changed - poetic-froyo-8baba7
🔇 Additional comments (4)
tests/fixtures/points.py (4)
26-42
: Verify vector dimensionality matchesThe function
random_vectors
now uses preloaded vectors whenidx
is provided. However, there's no validation to ensure that the dimensionality of the preloaded vectors matches thevector_sizes
parameter.Add validation to ensure that the dimensions of the preloaded vectors match the expected dimensions:
def random_vectors(vector_sizes: Union[dict[str, int], int], idx=None) -> models.VectorStruct: if isinstance(vector_sizes, int): if idx: + # Verify vector dimensions match + if len(_text_vectors_clean[idx]) != vector_sizes: + raise ValueError(f"Preloaded vector dimension {len(_text_vectors_clean[idx])} does not match requested dimension {vector_sizes}") return _text_vectors_clean[idx] else: return np.random.random(vector_sizes).round(3).tolist()This will help avoid subtle bugs when the dimensions don't match.
46-48
: Handle potential index out of range errorsThe
random_multivectors
function now usessample_random_multivector
but there's no check to ensure thatvec_count
is not larger than the total number of available vectors in_text_vectors_clean
.Add validation to ensure that
vec_count
doesn't exceed the number of available vectors:def random_multivectors(vector_sizes: Union[dict[str, int], int]) -> models.VectorStruct: if isinstance(vector_sizes, int): vec_count = random.randint(1, 10) + vec_count = min(vec_count, len(_text_vectors_clean)) return sample_random_multivector(vector_sizes, vec_count)
This will prevent index out of range errors.
51-53
: Apply the same validation for multi-vector dictionary caseSimilar to the previous comment, the multi-vector dictionary case needs the same validation to prevent potential index out of range errors.
Add validation for the dictionary case as well:
for vector_name, vector_size in vector_sizes.items(): vec_count = random.randint(1, 10) + vec_count = min(vec_count, len(_text_vectors_clean)) vectors[vector_name] = sample_random_multivector(vector_size, vec_count)
This ensures consistency between the two code paths.
124-125
: Add bounds check for number of pointsThe
generate_points
function samples indices without checking ifnum_points
exceeds the number of available vectors.Add a validation to ensure
num_points
doesn't exceed the available vectors:+ if num_points > len(_text_vectors_clean): + raise ValueError(f"Requested {num_points} points, but only {len(_text_vectors_clean)} vectors are available") sampled_vectors = np.random.choice(len(_text_vectors_clean), size=num_points, replace=False)This prevents potential runtime errors when requesting more points than available vectors.
tests/fixtures/points.py
Outdated
_text_vectors = np.load("data/text.npy", mmap_mode="r") | ||
_text_vectors_unique = np.unique(_text_vectors, axis=0) | ||
_text_vectors_clean = _text_vectors_unique[~np.isnan(_text_vectors_unique).any(axis=1)].tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add fallback mechanism for missing data files
Currently, if the data files are missing, the test will fail completely. Consider adding a fallback mechanism that uses randomly generated vectors when the data files are not found.
Implement a try-except block to gracefully handle missing data files:
try:
_text_vectors = np.load("data/text.npy", mmap_mode="r")
_text_vectors_unique = np.unique(_text_vectors, axis=0)
_text_vectors_clean = _text_vectors_unique[~np.isnan(_text_vectors_unique).any(axis=1)].tolist()
_using_real_vectors = True
except FileNotFoundError:
print("Warning: data/text.npy not found. Using randomly generated vectors instead.")
_text_vectors_clean = [] # Empty list as fallback
_using_real_vectors = False
try:
_query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32)
_query_vectors_unique = np.unique(_query_vectors, axis=0)
_query_vectors_clean = _query_vectors_unique.tolist()
except FileNotFoundError:
print("Warning: data/queries.npy not found. Using randomly generated vectors instead.")
_query_vectors_clean = [] # Empty list as fallback
Then modify the functions to check _using_real_vectors
and fall back to random generation when needed:
def sample_queries(n: int) -> list[np.array]:
+ if not _using_real_vectors or len(_query_vectors_clean) == 0:
+ # Fallback to random generation
+ return [np.random.random(384).round(3).tolist() for _ in range(n)]
sampled_vectors = np.random.choice(len(_query_vectors_clean), size=n, replace=False)
return [_query_vectors_clean[i] for i in sampled_vectors]
This would make the tests more robust and still allow them to run without the data files.
Also applies to: 18-24, 65-68, 124-124
🧰 Tools
🪛 GitHub Actions: Integration tests
[error] 13-13: FileNotFoundError: No such file or directory: 'data/text.npy'. The test failed because the required data file 'data/text.npy' is missing.
Fix the file path reference issue
The code is attempting to load data files with hardcoded paths, but the pipeline has failed with a FileNotFoundError: No such file or directory: 'data/text.npy'
. This indicates that either:
- The data files are missing from the repository, or
- The path is incorrect relative to where the tests are executed
Consider one of these solutions:
- Include the data files in the PR or document how to obtain them
- Use a relative path that accounts for the project structure:
- _text_vectors = np.load("data/text.npy", mmap_mode="r")
+ _text_vectors = np.load(os.path.join(os.path.dirname(__file__), "../../data/text.npy"), mmap_mode="r")
Don't forget to add the import os
statement at the top of the file.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
_text_vectors = np.load("data/text.npy", mmap_mode="r") | |
_text_vectors_unique = np.unique(_text_vectors, axis=0) | |
_text_vectors_clean = _text_vectors_unique[~np.isnan(_text_vectors_unique).any(axis=1)].tolist() | |
import os | |
_text_vectors = np.load( | |
os.path.join(os.path.dirname(__file__), "../../data/text.npy"), | |
mmap_mode="r" | |
) | |
_text_vectors_unique = np.unique(_text_vectors, axis=0) | |
_text_vectors_clean = _text_vectors_unique[ | |
~np.isnan(_text_vectors_unique).any(axis=1) | |
].tolist() |
🧰 Tools
🪛 GitHub Actions: Integration tests
[error] 13-13: FileNotFoundError: No such file or directory: 'data/text.npy'. The test failed because the required data file 'data/text.npy' is missing.
tests/fixtures/points.py
Outdated
def sample_random_multivector(vec_size: int, vec_count: int) -> list[list[float]]: | ||
sampled_vectors = np.random.choice(len(_text_vectors_clean), size=vec_count, replace=False) | ||
return [_text_vectors_clean[i] for i in sampled_vectors] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add dimension validation to sample_random_multivector
The sample_random_multivector
function doesn't validate that the dimensions of the preloaded vectors match the requested vec_size
.
Add dimension validation:
def sample_random_multivector(vec_size: int, vec_count: int) -> list[list[float]]:
sampled_vectors = np.random.choice(len(_text_vectors_clean), size=vec_count, replace=False)
+ # Verify vector dimensions match
+ for i in sampled_vectors:
+ if len(_text_vectors_clean[i]) != vec_size:
+ raise ValueError(f"Preloaded vector dimension {len(_text_vectors_clean[i])} does not match requested dimension {vec_size}")
return [_text_vectors_clean[i] for i in sampled_vectors]
This will help catch dimension mismatches early.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def sample_random_multivector(vec_size: int, vec_count: int) -> list[list[float]]: | |
sampled_vectors = np.random.choice(len(_text_vectors_clean), size=vec_count, replace=False) | |
return [_text_vectors_clean[i] for i in sampled_vectors] | |
def sample_random_multivector(vec_size: int, vec_count: int) -> list[list[float]]: | |
sampled_vectors = np.random.choice(len(_text_vectors_clean), size=vec_count, replace=False) | |
# Verify vector dimensions match | |
for i in sampled_vectors: | |
if len(_text_vectors_clean[i]) != vec_size: | |
raise ValueError( | |
f"Preloaded vector dimension {len(_text_vectors_clean[i])} " | |
f"does not match requested dimension {vec_size}" | |
) | |
return [_text_vectors_clean[i] for i in sampled_vectors] |
tests/fixtures/points.py
Outdated
def sample_queries(n: int) -> list[np.array]: | ||
_query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32) | ||
_query_vectors_unique = np.unique(_query_vectors, axis=0) | ||
_query_vectors = _query_vectors_unique.tolist() | ||
sampled_vectors = np.random.choice(len(_query_vectors), size=n, replace=False) | ||
return [_query_vectors[i] for i in sampled_vectors] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve data loading efficiency and error handling
The sample_queries
function loads the data from disk on every call, which is inefficient. Additionally, it lacks error handling for the file loading operation.
Consider loading the query vectors once at module level, similar to how you handle _text_vectors
:
+ _query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32)
+ _query_vectors_unique = np.unique(_query_vectors, axis=0)
+ _query_vectors_clean = _query_vectors_unique.tolist()
def sample_queries(n: int) -> list[np.array]:
- _query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32)
- _query_vectors_unique = np.unique(_query_vectors, axis=0)
- _query_vectors = _query_vectors_unique.tolist()
- sampled_vectors = np.random.choice(len(_query_vectors), size=n, replace=False)
- return [_query_vectors[i] for i in sampled_vectors]
+ sampled_vectors = np.random.choice(len(_query_vectors_clean), size=n, replace=False)
+ return [_query_vectors_clean[i] for i in sampled_vectors]
Also, add exception handling to provide better error messages if the files can't be loaded.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def sample_queries(n: int) -> list[np.array]: | |
_query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32) | |
_query_vectors_unique = np.unique(_query_vectors, axis=0) | |
_query_vectors = _query_vectors_unique.tolist() | |
sampled_vectors = np.random.choice(len(_query_vectors), size=n, replace=False) | |
return [_query_vectors[i] for i in sampled_vectors] | |
# Module‐level loading & preprocessing of query vectors | |
_query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32) | |
_query_vectors_unique = np.unique(_query_vectors, axis=0) | |
_query_vectors_clean = _query_vectors_unique.tolist() | |
def sample_queries(n: int) -> list[np.array]: | |
sampled_vectors = np.random.choice(len(_query_vectors_clean), size=n, replace=False) | |
return [_query_vectors_clean[i] for i in sampled_vectors] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
tests/fixtures/points.py (3)
15-17
: Fix the file path reference and add error handlingYour code is attempting to load data files with hardcoded paths, but this approach is fragile and may fail in different environments. The absence of error handling will cause tests to fail unexpectedly when files are missing.
Run this script to check the file access issue:
#!/bin/bash # Check if the data files exist in the expected locations echo "Current directory: $(pwd)" echo "Checking for data files:" ls -la data/ 2>/dev/null || echo "data/ directory not found" echo "Searching for .npy files:" find . -name "*.npy" -type f
20-27
: Improve data loading efficiency insample_queries
The function loads the data file on every call, which is inefficient. Similar to how you handle
_text_vectors
, consider loading the query vectors once at module level.
69-73
: Add dimension validation tosample_random_multivector
The function doesn't validate that the dimensions of the preloaded vectors match the requested
vec_size
, which could lead to dimension mismatches during testing.
🧹 Nitpick comments (4)
tests/fixtures/points.py (4)
148-148
: Fix unused loop variableThe loop control variable
name
is not used within the loop body.- for name, vec in vectors.items(): + for _name, vec in vectors.items():🧰 Tools
🪛 Ruff (0.11.9)
148-148: Loop control variable
name
not used within loop bodyRename unused
name
to_name
(B007)
152-152
: Fix unused loop variableThe loop control variable
name
is not used within the loop body.- for name, vec in enumerate(vectors): + for _name, vec in enumerate(vectors):🧰 Tools
🪛 Ruff (0.11.9)
152-152: Loop control variable
name
not used within loop bodyRename unused
name
to_name
(B007)
30-35
: Add type annotation for theidx
parameterThe
idx
parameter lacks a type annotation, making it unclear what type is expected.-def random_vectors(vector_sizes: Union[dict[str, int], int], idx=None) -> models.VectorStruct: +def random_vectors(vector_sizes: Union[dict[str, int], int], idx: Optional[int] = None) -> models.VectorStruct:Don't forget to add
from typing import Optional
at the top of the file.
130-132
: Consider adding a docstring to describe vector samplingThe code samples vectors from preloaded data rather than generating them randomly, which is a significant change in behavior. Consider adding a docstring to explain this new approach.
def generate_points( num_points: int, vector_sizes: Union[dict[str, int], int], with_payload: bool = False, random_ids: bool = False, skip_vectors: bool = False, sparse: bool = False, even_sparse: bool = True, multivector: bool = False, ) -> list[models.PointStruct]: + """ + Generate points with vectors for testing. + + When not using sparse or multivector options, this function samples vectors from + preloaded data rather than generating them randomly, ensuring more realistic test data. + """ if skip_vectors and isinstance(vector_sizes, int): raise ValueError("skip_vectors is not supported for single vector")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
tests/congruence_tests/test_collections.py
(2 hunks)tests/congruence_tests/test_common.py
(1 hunks)tests/embed_tests/test_local_inference.py
(1 hunks)tests/fixtures/points.py
(5 hunks)tests/test_fastembed.py
(3 hunks)
✅ Files skipped from review due to trivial changes (2)
- tests/test_fastembed.py
- tests/embed_tests/test_local_inference.py
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/congruence_tests/test_collections.py
- tests/congruence_tests/test_common.py
🧰 Additional context used
🪛 Ruff (0.11.9)
tests/fixtures/points.py
148-148: Loop control variable name
not used within loop body
Rename unused name
to _name
(B007)
152-152: Loop control variable name
not used within loop body
Rename unused name
to _name
(B007)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Redirect rules - poetic-froyo-8baba7
- GitHub Check: Header rules - poetic-froyo-8baba7
- GitHub Check: Pages changed - poetic-froyo-8baba7
🔇 Additional comments (1)
tests/fixtures/points.py (1)
147-154
: Approve NaN and type assertionsGood addition of assertions to validate that vectors don't contain NaN values and have the appropriate numeric types. This helps catch data quality issues early in testing.
🧰 Tools
🪛 Ruff (0.11.9)
148-148: Loop control variable
name
not used within loop bodyRename unused
name
to_name
(B007)
152-152: Loop control variable
name
not used within loop bodyRename unused
name
to_name
(B007)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (4)
tests/fixtures/points.py (4)
77-81
: Add dimension validation tosample_random_multivector
The
sample_random_multivector
function doesn't validate that the dimensions of the preloaded vectors match the requestedvec_size
.Add dimension validation:
def sample_random_multivector(vec_size: int, vec_count: int) -> list[list[float]]: doc_vectors = _text_vectors_clean.copy() + # Verify vector dimensions match + if len(doc_vectors) > 0 and len(doc_vectors[0]) != vec_size: + raise ValueError(f"Preloaded vector dimension {len(doc_vectors[0])} does not match requested dimension {vec_size}") sampled_vectors = np.random.choice(len(doc_vectors), size=vec_count, replace=False) return [np.array(doc_vectors[i]).astype(np.float32).tolist() for i in sampled_vectors]
16-23
:⚠️ Potential issueFix the file path reference issue
The code is attempting to load data files with hardcoded paths, but the pipeline has failed with a
FileNotFoundError: No such file or directory: 'data/text.npy'
. This indicates that either:
- The data files are missing from the repository, or
- The path is incorrect relative to where the tests are executed
Consider one of these solutions:
- Include the data files in the PR or document how to obtain them
- Use a relative path that accounts for the project structure:
- _text_vectors = np.load("data/text.npy", mmap_mode="r")[..., :text_vector_size] + _text_vectors = np.load(os.path.join(os.path.dirname(__file__), "../../data/text.npy"), mmap_mode="r")[..., :text_vector_size]Don't forget to add the
import os
statement at the top of the file.🧰 Tools
🪛 GitHub Actions: Integration tests
[error] 17-17: FileNotFoundError: No such file or directory: 'data/text.npy' - required data file missing causing test collection failure.
25-25
: 🛠️ Refactor suggestionAdd fallback mechanism for missing data files
Currently, if the data files are missing, the test will fail completely. Consider adding a fallback mechanism that uses randomly generated vectors when the data files are not found.
Implement a try-except block to gracefully handle missing data files:
-_text_vectors_clean = generate_vectors() +try: + _text_vectors_clean = generate_vectors() + _using_real_vectors = True +except FileNotFoundError: + print("Warning: data/text.npy not found. Using randomly generated vectors instead.") + # Generate fallback random vectors + _text_vectors_clean = [np.random.random(text_vector_size).tolist() for _ in range(100)] + _using_real_vectors = False
28-36
: 🛠️ Refactor suggestionImprove data loading efficiency and error handling
The
sample_queries
function loads the data from disk on every call, which is inefficient. Additionally, it lacks error handling for the file loading operation.Consider loading the query vectors once at module level, similar to how you handle
_text_vectors_clean
:+try: + _query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32)[ + ..., :text_vector_size + ] + _query_vectors_unique = np.unique(_query_vectors, axis=0) + _query_vectors_clean = _query_vectors_unique.tolist() +except FileNotFoundError: + print("Warning: data/queries.npy not found. Using randomly generated vectors instead.") + _query_vectors_clean = [np.random.random(text_vector_size).tolist() for _ in range(100)] def sample_queries(n: int) -> list[np.array]: - _query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32)[ - ..., :text_vector_size - ] - _query_vectors_unique = np.unique(_query_vectors, axis=0) - _query_vectors = _query_vectors_unique.tolist() - sampled_vectors = np.random.choice(len(_query_vectors), size=n, replace=False) - return [_query_vectors[i].copy() for i in sampled_vectors] + sampled_vectors = np.random.choice(len(_query_vectors_clean), size=n, replace=False) + return [_query_vectors_clean[i].copy() for i in sampled_vectors]
🧹 Nitpick comments (2)
tests/fixtures/points.py (2)
156-159
: Fix unused loop variableThe loop control variable
name
is not used within the loop body.- for name, vec in vectors.items(): + for _name, vec in vectors.items(): assert np.array(vec).dtype.kind in ("f", "i") # float or int assert not np.isnan(np.array(vec)).any()🧰 Tools
🪛 Ruff (0.11.9)
156-156: Loop control variable
name
not used within loop bodyRename unused
name
to_name
(B007)
160-163
: Fix unused loop variableThe loop control variable
name
is not used within the loop body.- for name, vec in enumerate(vectors): + for _name, vec in enumerate(vectors): assert np.array(vec).dtype.kind in ("f", "i") # float or int assert not np.isnan(np.array(vec)).any()🧰 Tools
🪛 Ruff (0.11.9)
160-160: Loop control variable
name
not used within loop bodyRename unused
name
to_name
(B007)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/congruence_tests/test_search.py
(1 hunks)tests/fixtures/points.py
(5 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
tests/congruence_tests/test_search.py (1)
tests/fixtures/points.py (1)
generate_vectors
(16-22)
tests/fixtures/points.py (1)
qdrant_client/http/models/models.py (1)
SparseVector
(2706-2712)
🪛 Ruff (0.11.9)
tests/fixtures/points.py
156-156: Loop control variable name
not used within loop body
Rename unused name
to _name
(B007)
160-160: Loop control variable name
not used within loop body
Rename unused name
to _name
(B007)
🪛 GitHub Actions: Integration tests
tests/fixtures/points.py
[error] 17-17: FileNotFoundError: No such file or directory: 'data/text.npy' - required data file missing causing test collection failure.
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Redirect rules - poetic-froyo-8baba7
- GitHub Check: Header rules - poetic-froyo-8baba7
- GitHub Check: Pages changed - poetic-froyo-8baba7
🔇 Additional comments (3)
tests/fixtures/points.py (2)
13-13
: Good addition of a centralized vector size constant.This makes the vector size consistent across tests, improving maintainability.
38-51
: LGTM! Good implementation of optional index parameter.The modification allows for reusing preloaded vectors when an index is provided, which improves consistency across tests.
tests/congruence_tests/test_search.py (1)
19-19
: LGTM! Good import addition.The import of
generate_vectors
fromtests.fixtures.points
aligns with the goal of standardizing vector generation across tests.
_text_vectors = generate_vectors() | ||
sampled_vectors = np.random.choice(len(_text_vectors), size=3, replace=False) | ||
|
||
self.query_text = _text_vectors[sampled_vectors[0]] | ||
self.query_image = _text_vectors[sampled_vectors[1]] | ||
self.query_code = _text_vectors[sampled_vectors[2]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Improve test reliability by handling potential missing data files
The test now depends on external data files, which might not be available in all environments. Consider adding error handling for when generate_vectors()
fails.
This dependency on external data files could make the tests fail in environments where the data files aren't present. Make sure this is compatible with the test strategy for the project.
🏁 Script executed:
#!/bin/bash
# Check if the test is trying to access the data files
grep -r "data/text.npy" --include="*.py" .
grep -r "data/queries.npy" --include="*.py" .
# Check if these files exist in the repository
fd "text.npy" .
fd "queries.npy" .
Length of output: 349
Handle missing data files in tests/fixtures/points.py
The helper in tests/fixtures/points.py loads
np.load("data/text.npy", …)
np.load("data/queries.npy", …)
but neither file is bundled in the repo. In environments without those .npy files, generate_vectors() will raise FileNotFoundError and break the tests. To improve reliability, please choose one of the following:
- Wrap each np.load in a try/except that catches FileNotFoundError and calls pytest.skip(…) with a clear message.
- Add sample .npy files under tests/fixtures/data/ (and update the paths accordingly).
- Mock or patch np.load in your test suite so that generate_vectors() doesn’t depend on external files.
🤖 Prompt for AI Agents
In tests/congruence_tests/test_search.py around lines 26 to 31, the call to
generate_vectors() depends on external .npy data files that may be missing in
some environments, causing the test to fail. To fix this, add error handling
around generate_vectors() to catch FileNotFoundError and call pytest.skip() with
a clear message explaining the missing data files. This will gracefully skip the
test when the required files are not present, improving test reliability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
tests/fixtures/points.py (2)
16-23
:⚠️ Potential issueHard-coded data path still crashes the test run – add robust fallback logic
generate_vectors()
will raiseFileNotFoundError
during import time wheneverdata/queries_clean.npy
is absent (see current CI failure).
This exact problem was flagged in the previous review but remains unresolved.-def generate_vectors(): - _text_vectors = np.load("data/queries_clean.npy", mmap_mode="r")[..., :text_vector_size] - ... - return _text_vectors_clean +def generate_vectors() -> list[list[float]]: + """Load & sanitise vectors, falling back to an empty list when the file is missing.""" + data_path = os.path.join(os.path.dirname(__file__), "../../data/queries_clean.npy") + try: + _text_vectors = np.load(data_path, mmap_mode="r")[..., :text_vector_size] + except FileNotFoundError: + print(f"⚠ {data_path} not found – falling back to random vectors.") + return [] + + _text_vectors_unique = np.unique(_text_vectors, axis=0) + _text_vectors_clean = _text_vectors_unique[~np.isnan(_text_vectors_unique).any(axis=1)] + return _text_vectors_clean.tolist()Don’t forget to
import os
at the top of the file.🧰 Tools
🪛 GitHub Actions: Integration tests
[error] 17-17: FileNotFoundError: No such file or directory: 'data/queries_clean.npy' - required numpy data file missing causing test collection failure.
28-35
: 🛠️ Refactor suggestion
⚠️ Potential issueRepeated disk-loads, path issues and missing-file crashes in
sample_queries
- The vector file name (
text_clean.npy
) conflicts with the"queries_clean.npy"
used above—please double-check the intent.- The file is loaded every time
sample_queries
is called, which is wasteful.- No error handling means the entire test suite still explodes when the file is missing.
Proposed refactor (loads once, shares the same robust fallback pattern):
- def sample_queries(n: int) -> list[np.array]: - _query_vectors = np.load("data/text_clean.npy", allow_pickle=True).astype(np.float32)[..., :text_vector_size] - _query_vectors_unique = np.unique(_query_vectors, axis=0) - _query_vectors = _query_vectors_unique.tolist() - sampled_vectors = np.random.choice(len(_query_vectors), size=n, replace=False) - return [_query_vectors[i].copy() for i in sampled_vectors] + +try: + _query_vectors_clean: list[list[float]] = ( + np.load(os.path.join(os.path.dirname(__file__), "../../data/text_clean.npy"), allow_pickle=True) + .astype(np.float32)[..., :text_vector_size] + .tolist() + ) +except FileNotFoundError: + _query_vectors_clean = [] + +def sample_queries(n: int) -> list[np.array]: + if not _query_vectors_clean: + return [np.random.random(text_vector_size).astype(np.float32).tolist() for _ in range(n)] + + sampled = np.random.choice(len(_query_vectors_clean), size=n, replace=False) + return [_query_vectors_clean[i].copy() for i in sampled]This removes the hot-path I/O and keeps tests runnable without external data.
🧹 Nitpick comments (2)
tests/fixtures/points.py (2)
70-74
: Dead code:generate_random_multivector
is no longer referencedThe helper is never called after migrating to
sample_random_multivector
. Consider deleting it to keep the fixture lean.
158-165
: Rename unused loop variables to appease Ruff and improve clarityBoth loops iterate only for the side-effects of assertions; the key/index is unused.
- for name, vec in vectors.items(): + for _unused, vec in vectors.items(): ... - for name, vec in enumerate(vectors): + for _idx, vec in enumerate(vectors):🧰 Tools
🪛 Ruff (0.11.9)
159-159: Loop control variable
name
not used within loop bodyRename unused
name
to_name
(B007)
163-163: Loop control variable
name
not used within loop bodyRename unused
name
to_name
(B007)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/congruence_tests/test_multivector_recommend_queries.py
(4 hunks)tests/fixtures/points.py
(5 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/congruence_tests/test_multivector_recommend_queries.py (1)
qdrant_client/http/models/models.py (1)
RecommendStrategy
(2056-2066)
🪛 Ruff (0.11.9)
tests/fixtures/points.py
159-159: Loop control variable name
not used within loop body
Rename unused name
to _name
(B007)
163-163: Loop control variable name
not used within loop body
Rename unused name
to _name
(B007)
🪛 GitHub Actions: Integration tests
tests/fixtures/points.py
[error] 17-17: FileNotFoundError: No such file or directory: 'data/queries_clean.npy' - required numpy data file missing causing test collection failure.
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Redirect rules - poetic-froyo-8baba7
- GitHub Check: Header rules - poetic-froyo-8baba7
- GitHub Check: Pages changed - poetic-froyo-8baba7
🔇 Additional comments (5)
tests/congruence_tests/test_multivector_recommend_queries.py (5)
157-157
: Expanded negative examples list for more comprehensive testingThis change adds an additional negative example ID (14) to the
only_negatives_best_score_recommend_euclid
method, which aligns with the PR objective of enhancing tests with more realistic vector configurations. Using more example points provides a more comprehensive test for the best score recommendation strategy.
171-171
: Enhanced positive examples for sum_scores recommendation testingThe addition of a third positive example (30) to the
sum_scores_recommend
method provides a more robust test case for the sum scores recommendation strategy. This change is consistent with the PR's goal of using real vectors for testing and helps ensure the recommendation algorithm works correctly with multiple input vectors.
205-207
: Expanded negative examples list for sum_scores recommendationThis change adds a third negative example ID (14) to the
only_negatives_sum_scores_recommend
method, consistent with the other modifications in this PR. Using three negative examples instead of two provides a more thorough test of the sum_scores recommendation strategy with negative-only inputs.
224-224
: Added third negative example to sum_scores recommendation with Euclidean distanceThe addition of a third negative example ID (14) to the
only_negatives_sum_scores_recommend_euclid
method maintains consistency with the other test methods and provides more comprehensive testing for the sum_scores strategy using Euclidean distance.
271-319
:✅ Verification successful
Verify test consistency with expanded vector examples
The main test function
test_simple_recommend()
has remained unchanged, which is good as it ensures the testing workflow remains consistent. However, it would be beneficial to verify that the expanded number of example vectors (adding ID 14 to negative examples and ID 30 to positive examples) doesn't unexpectedly alter test outcomes.
🏁 Script executed:
#!/bin/bash # Check if the added vector IDs (14, 30) are valid and exist in the test fixtures # Search for references to the generation of test fixtures rg -A 5 "generate_multivector_fixtures" tests/Length of output: 15913
🏁 Script executed:
#!/bin/bash # Display the full implementation of generate_multivector_fixtures for ID assignment logic rg -n "def generate_multivector_fixtures" -A200 tests/congruence_tests/test_common.pyLength of output: 9301
No action needed: fixture ID expansion won’t affect consistency checks
We’ve verified that
generate_multivector_fixtures
simply delegates togenerate_points
with sequential IDs (defaultrandom_ids=False
), so adding more vectors only extends the ID range without altering existing points. All assertions intest_simple_recommend()
compare local vs. remote client behavior (not specific IDs), so expanding the fixture set will not change test outcomes.
def random_vectors(vector_sizes: Union[dict[str, int], int], idx=None) -> models.VectorStruct: | ||
if isinstance(vector_sizes, int): | ||
return np.random.random(vector_sizes).round(3).tolist() | ||
if idx: | ||
return _text_vectors_clean[idx].copy() | ||
else: | ||
return np.random.random(vector_sizes).tolist() # .round(3) | ||
elif isinstance(vector_sizes, dict): | ||
vectors = {} | ||
for vector_name, vector_size in vector_sizes.items(): | ||
vectors[vector_name] = np.random.random(vector_size).round(3).tolist() | ||
if idx: | ||
vectors[vector_name] = _text_vectors_clean[idx].copy() | ||
else: | ||
vectors[vector_name] = np.random.random(vector_size).tolist() # .round(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idx
falsy check causes silent mis-sampling when idx == 0
if idx:
treats 0
as “not supplied”, so the very first sampled index falls back to a random vector instead of the expected pre-loaded one.
- if idx:
+ if idx is not None:
return _text_vectors_clean[idx].copy()
Apply the same fix inside the dictionary branch (lines 47-48
).
🤖 Prompt for AI Agents
In tests/fixtures/points.py around lines 38 to 50, the condition checking if idx
is supplied uses a falsy check (if idx:), which incorrectly treats idx == 0 as
not supplied, causing silent mis-sampling. Change the condition to explicitly
check for None (if idx is not None:) in both the integer and dictionary branches
to correctly handle idx == 0 and ensure the pre-loaded vector is returned when
idx is zero.
Using real vectors for tests proof-of-concept